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Abstract 

We consider the problem of recovering a low-rank tensor from its noisy observation. Previ¬ 
ous work has shown a recovery guarantee with signal to noise ratio ()(rJ K ^/ 2 ) f or recovering 
a K th order rank one tensor of size n x • • • x n by recursive unfolding. In this paper, we first 
improve this bound to 0(n A / 4 ) by a much simpler approach, but with a more careful analysis. 
Then we propose a new norm called the subspace norm, which is based on the Kronecker prod¬ 
ucts of factors obtained by the proposed simple estimator. The imposed Kronecker structure 
allows us to show a nearly ideal 0{y/n + VH K ~ l ) bound, in which the parameter H controls 
the blend from the non-convex estimator to mode-wise nuclear norm minimization. Further¬ 
more, we empirically demonstrate that the subspace norm achieves the nearly ideal denoising 
performance even with H = 0(1). 


1 Introduction 

Tensor is a natural way to express higher order interactions for a variety of data and tensor decom¬ 
position has been successfully applied to wide areas ranging from chemometrics, signal processing, 
to neuroimaging; see [15, 18] for a survey. Moreover, recently it has become an active area in the 
context of learning latent variable models [3]. 

Many problems related to tensors, such as, finding the rank, or a best rank-one approaximation 
of a tensor is known to be NP hard [11, 8]. Nevertheless we can address statistical problems, 
such as, how well we can recover a low-rank tensor from its randomly corrupted version (tensor 
denoising) or from partial observations (tensor completion). Since we can convert a tensor into a 
matrix by an operation known as unfolding, recent work [25, 19, 20, 13] has shown that we do get 
nontrivial guarantees by using some norms or singular value decompositions. More specifically, 
Richard & Montanari [20] has shown that when a rank-one A'th order tensor of size n x • • • x n 
is corrupted by standard Gaussian noise, a nontrivial bound can be shown with high probability 
if the signal to noise ratio fd/cr "f w ) K / 2 1 / 2 by a method called the recursive unfolding 1 . Note 

1 We say a n ^ b n if there is a constant C > 0 such that a n > C ■ b n . 
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that /3/a /z a fn is sufficient for matrices (K = 2) and also for tensors if we use the best rank- 
one approximation (which is known to be NP hard) as an estimator. On the other hand, Jain & 
Oh [13] analyzed the tensor completion problem and proposed an algorithm that requires 0(n 3//2 ■ 
polylog(n)) samples for K = 3; while information theoretically we need at least Q(n) samples and 
the intractable maximum likelihood estimator would require O (n-poly log (n)) samples. Therefore, 
in both settings, there is a wide gap between the ideal estimator and current polynomial time 
algorithms. A subtle question that we will address in this paper is whether we need to unfold 
the tensor so that the resulting matrix become as square as possible, which was the reasoning 
underlying both [19, 20]. 

As a parallel development, non-convex estimators based on alternating minimization or non¬ 
linear optimization [1, 21] have been widely applied and have performed very well when appro¬ 
priately set up. Therefore it would be of fundamental importance to connect the wisdom of non- 
convex estimators with the more theoretically motivated estimators that recently emerged. 

In this paper, we explore such a connection by defining a new norm based on Kronecker prod¬ 
ucts of factors that can be obtained by simple mode-wise singular value decomposition (SVD) of 
unfoldings (see notation section below), also known as the higher-order singular value decompo¬ 
sition (HOSVD) [6, 7]. We first study the non-asymptotic behavior of the leading singular vector 
from the ordinary (rectangular) unfolding X and show a nontrivial bound for signal to noise 
ratio /3/a ^ n K ^. Thus the result also applies to odd order tensors confirming a conjecture in 
[20]. Furthermore, this motivates us to use the solution of mode-wise truncated SVDs to construct 
a new norm. We propose the subspace norm, which predicts an unknown low-rank tensor as a 
mixture of K low-rank tensors, in which each term takes the form 

fold<8> • • • ® P (fc_1) <8) P ik+1) <8> • • • ® P <A) ) t ), 

where fold fc is the inverse of unfolding (•)(*.), (8) denotes the Kronecker product, and P 6 M nx H 
is a orthonormal matrix estimated from the mode-fc unfolding of the observed tensor, for k = 
1,..., K; H is a user-defined parameter, and M (l ' } e M. nxH . Our theory tells us that with 

- (fc) 

sufficiently high signal-to-noise ratio the estimated P spans the true factors. 

We highlight our contributions below: 

1. We prove that the required signal-to-noise ratio for recovering a /v'th order rank one tensor from 
the ordinary unfolding is (){n K i l ). Our analysis shows a curious two phase behavior: with high 
probability, when n K ^ A !3/a A n K t 2 , the error shows a fast decay as 1//3 4 ; for [3/a >z n K ^ 2 , the 
error decays slowly as 1 / (3 2 . We confirm this in a numerical simulation. 

2. The proposed subspace norm is an interpolation between the intractable estimators that directly 
control the rank (e.g., HOSVD) and the tractable norm-based estimators. It becomes equivalent to 
the latent trace norm [23] when H = n at the cost of increased signal-to-noise ratio threshold (see 
Table 1). 

3. The proposed estimator is more efficient than previously proposed norm based estimators, 
because the size of the SVD required in the algorithm is reduced from n x n A_1 to n x H K ~ 1 . 

4. We also empirically demonstrate that the proposed subspace norm performs nearly optimally 
for constant order H. 
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Table 1: Comparison of required signal-to-noise ratio j3/o of different algorithms for recovering 
a A'th order rank one tensor of size n x • • • x n contaminated by Gaussian noise with Standard 
deviation a. See model (2). The bound for the ordinary unfolding is shown in Corollary 1. The 
bound for the subspace norm is shown in Theorem 2. The ideal estimator is proven in Appendix 
A. 


Overlapped/ 
Latent nuclear 
norm[23] 

Recursive 
unfolding [ 20 ]/ 
square norm[19] 

Ordinary 

unfolding 

Subspace norm (pro¬ 
posed) 

Ideal 

0 ( n (ir-!)/ 2 ) 

0 (nWl/ 2 ) 

0(n K / 4 ) 


0(^nKlog(K)) 


Notation 

Let X G f n i XTi 2x-xnK b e a A'th order tensor. We will often use n\ — ■ ■ ■ — tik = n to simplify 
the notation but all the results in this paper generalizes to general dimensions. The inner product 
between a pair of tensors is defined as the inner products of them as vectors; i.e., (X. W) = 
(vec(A),vec(W)). For u G M ni ,v G M. n2 ,w G M™ 3 , uo v o w denotes the rii x n 2 x n 3 
rank-one tensor whose i,j,k entry is UiVjWk . The rank of X is the minimum number of rank-one 
tensors required to write A as a linear combination of them. A mode-/;: fiber of tensor X is an n k 
dimensional vector that is obtained by fixing all but the kth index of X. The mode-/,: unfolding 
Xfj.j of tensor X is an n k x W k ,^ k n u matrix constructed by concatenating all the mode-/;- fibers 
along columns. We denote the spectral and Frobenius norms for matrices by || • || and || • || F , 
respectively. 

2 The power of ordinary unfolding 

2.1 A perturbation bound for the left singular vector 

We first establish a bound on recovering the left singular vector of a rank-one n x m matrix (with 
m > n) perturbed by random Gaussian noise. 

Consider the following model known as the information plus noise model [4]: 

X = /3uv t + oE , ( 1 ) 

where u and v are unit vectors, 3 is the signal strength, a is the noise standard deviation, and 
the noise matrix E is assumed to be random with entries sampled i.i.d. from the standard normal 
distribution. Our goal is to lower-bound the correlation between u and the top left singular vector 
ii of X for signal-to-noise ratio /3/cr £3 (mn ) 1 / 4 with high probability. 

A direct application of the classic Wedin perturbation theorem [28] to the rectangular matrix 
X does not provide us the desired result. This is because it requires the signal to noise ratio 
f3/cr > 2||_E7||. Since the spectral norm of E scales as O p (y/n + y/m) [27], this would mean that 
we require (3/cr ^3 m 1//2 ; i.e., the threshold is dominated by the number of columns m, if m > n. 


3 









Alternatively, we can view u as the leading eigenvector of XX , a square matrix. Our key 
insight is that we can decompose XX as follows: 


XX = (f3 2 uu ' + ma 2 I) + (a EE ' - ma 2 I) + fia{uv 1 E 1 + Evu 1 ). 


Note that u is the leading eigenvector of the first term because adding an identity matrix does 
not change the eigenvectors. Moreover, we notice that there are two noise terms: the first term 
is a centered Wishart matrix and it is independent of the signal !3\ the second term is Gaussian 
distributed and depends on the signal (3. 

This implies a two-phase behavior corresponding to either the Wishart or the Gaussian noise 
term being dominant, depending on the value of (3. Interestingly, we get a different speed of con¬ 
vergence for each of these phases as we show in the next theorem (the proof is given in Appendix 
D.l). 


Theorem 1 . There exists a constant C such that with probability at least 1 — 4e 71 , ifm/n > C, 


(u,u)\ > 


Cnm ,— (3 i 

, if \Jm > — > (Own) 4 , 
a 


OW 2 

otherwise, |(w,u)| > 1 - jpfa 2 iffi/o > VCn. 


°n . B 

, if - > s/m, 


o 


In other words, if X has sufficiently many more columns than rows, as the signal to noise 
ratio f3/o increases, ii first converges to u as 1 /(3 4 , and then as 1 /(3 2 . Figure 1(a) illustrates these 
results. We randomly generate a rank-one 100 x 10000 matrix perturbed by Gaussian noise, and 
measure the distance between ii and u. The phase transition happens at (3/cr = (ran) 1 / 4 , and there 
are two regimes of different convergence rates as Theorem 1 predicts. 


2.2 Tensor Unfolding 

Now let’s apply the above result to the tensor version of information plus noise model studied by 
[20]. We consider a rank one n x ■ ■ • x n tensor (signal) contaminated by Gaussian noise as follows: 

y = X* + cr£ = o ■ ■ ■ o + a£, (2) 

where factors u (k ' ) 6 M", k = 1,.... K, are unit vectors, which are not necessarily identical, and 
the entries of £ G M nx "' xn are i.i.d samples from the normal distribution Af(0,1). Note that this is 
slightly more general (and easier to analyze) than the symmetric setting studied by [ 20 ] . 

Several estimators for recovering X* from its noisy version y have been proposed (see Table 
1). Both the overlapped nuclear norm and latent nuclear norm discussed in [23] achives the relative 
performance guarantee 

I* - **||U/3 < O p (oV^/(3) , (3) 
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n = 100 , m = 10000 




(a) Synthetic experiment showing phase transition at 
fjo = (nm) 1 ' 4 and regimes with different rates of 
convergence. See Theorem 1. 


(b) Synthetic experiment showing phase transition at 
P = O'dlfc ftfc) 1 / 4 for odd order tensors. See Corol¬ 
lary 1. 


Figure 1: Numerical demonstration of Theorem 1 and Corollary 1. 

where X is the estimator. This bound implies that if we want to obtain relative error smaller than 
e, we need the signal to noise ratio /3/a to scale as (3/a \Jn K ~ x /e. 

Mu et al. [19] proposed the square norm, defined as the nuclear norm of the matrix obtained 
by grouping the first [_K/2\ indices along the rows and the last \K/2] indices along the columns. 
This norm improves the right hand side of inequality (3) to O p {as/ni K ^ 2 ^ / f3), which translates to 
requiring (3/a ^ V'nJ K 3P je for obtaining relative error e. The intuition here is the more square 
the unfolding is the better the bound becomes. However, there is no improvement for K = 3. 

Richard and Montanari [20] studied the (symmetric version of) model (2) and proved that a 
recursive unfolding algorithm achieves the factor recovery error dist (u^ k \u^) = e with (3 /a 
\Zn^ K / 2 ^ /e with high probability, where dist (it, it 7 ) := min(||w — u'\\, ||it + it'll). They also 
showed that the randomly initialized tensor power method [7, 16, 3] can achieve the same error e 
with slightly worse threshold (3 /a ^3 max ( x/n/c 2 , n K 3 2 ) sjK log K also with high probability. 

The reasoning underlying both [19] and [20] is that square unfolding is better. However, if we 
take the (ordinary) mode-fc unfolding 

Y ( fc ) = /W fc) (it (fc_1) (8) • • • <gm (1) <g) v/ K) ® ® it (fe+1) ) T + aE {k) , (4) 

we can see (4) as an instance of information plus noise model (1) where m/n = n h 2 . Thus the 
ordinary unfolding satisfies the condition of Theorem 1 for n or K large enough. 

Corollary 1. Consider a K{> 3)th order rank one tensor contaminated by Gaussian noise as in 
(2). There exists a constant C such that ifn K ~ 2 > C, with probability at least 1 — 4A'e -n , we have 

if n 2 >f3/a>C^n^, 

for k — 1,..., K, 

if fi/(? > 

where u lln is the leading left singular vector of the rectangular unfolding Y/p. 


K 


dist 2 {u (k) . 


u 


{k)\ 


< 


2 Cn 
2 Cn 


((3/a\ 


|2 ’ 
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This proves that as conjectured by [20], the threshold /3 jo Z n h,/4 applies not only to the even 
order case but also to the odd order case. Note that Hopkins et al. [10] have shown a similar 
result without the sharp rate of convergence. The above corollary easily extends to more general 

711 x ■ ■ ■ x n k tensor by replacing the conditions by > P/ a ^ i C Y\k=i n k ) 1/4 and 

/3/a > ri£. The result also holds when X* has rank higher than 1; see Appendix E. 

We demonstrate this result in Figure 1(b). The models behind the experiment are slightly more 
general ones in which [n l5 n 2 , n 3 ] = [20,40, 60] or [40, 80,120] and the signal X* is rank two with 
/3i = 20 and /3 2 = 10. The plot shows the inner products (w ] 1 , u ] 1 ) and (u^\ ) as a measure 

of the quality of estimating the two mode-1 factors. The horizontal axis is the normalized noise 
standard deviation a{J^ =l n/,.) I,/4 . We can clearly see that the inner product decays symmetrically 
around (3\ and /3 2 as predicted by Corollary 1 for both tensors. 


3 Subspace norm for tensors 

Suppose the true tensor X* G M nx "' xn admits a minimum Tucker decomposition [26] of rank 


\>* _ y-'-R y-*R P, 

^ ~ 2^iu=i''' 2^i K =i P- 


.(!) 


(K) 


0 ‘ ‘ ‘ 0 Ui K 


(5) 


If the core tensor C = (f3i i ...i K ) G M i?x "' xR is superdiagonal , the above decomposition reduces to 
the canonical polyadic (CP) decomposition [9, 15]. The mode-/,: unfolding of the true tensor X* 
can be written as follows: 


x* [k) = u w c (k , (u (l) 


(fc) 


L/ (fe_1) <g) U [k+1) 


JJW 


( 6 ) 


where Cm is the mode-/,: unfolding of the core tensor C; U i, ' } is a nxR matrix = [u\ k 1 ,..., u 
for k — 1,..., Ii. Note that U (l ' ] is not necessarily orthogonal. 

Let X\ k) = P {k) A (fc) Q (fc)T be the SVD of X* (k) . We will observe that 


(fc)i 
R J 


Q {k) G Span fp (1) 


p( fc ~!) (g) p( fc +!) 


pi K ) 


(7) 


because of ( 6 ) and U^ G Span (P^). 

Corollary 1 shows that the left singular vectors P u ' ] can be recovered under mild conditions; 
thus the span of the right singular vectors can also be recovered. Inspired by this, we define a 
norm that models a tensor A" as a mixture of tensors Z i v / ..., Z ik ] . We require that the mode- 
k unfolding of Z^ k \ i.e. has a low rank factorization = M/^S^ T , where G 

M.nxH K 1 | s a variable, and S {k> G W lK ' y Hl< 1 is a fixed arbitrary orthonormal basis of some 
subspace, which we choose later to have the Kronecker structure in (7). 

In the following, we define the subspace norm, suggest an approach to construct the right factor 
S lk> , and prove the denoising bound in the end. 
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3.1 The subspace norm 

Consider a A'th order tensor of size n x • • • n. 

Definition 1. Let S {1 \ ... ,S (h) be matrices such that G MT K lxRK 1 with H < n. The 
subspace norm for a Kth order tensor X associated with {S (k) }£ =1 is defined as 

inf {M ( fc )}K i Ef=r \\ M(k) \\*, e Span({S {k) }f =1 ), 

+oo, otherwise , 

where ||-||* is the nuclear norm, and Span({S^}{ L x ) := {X G M nx "' xn : ... ,M^ K \X = 

Enfold fc (M«S (fc ) T )}. 

In the next lemma (proven in Appendix D.2), we show the dual norm of the subspace norm 
has a simple appealing form. As we see in Theorem 2, it avoids the 0(Vn^~^) scaling (see the 
first column of Table 1) by restricting the influence of the noise term in the subspace defined by 


W*)S w ||, 

where || • || is the spectral norm. 

3.2 Choosing the subspace 

A natural question that arises is how to choose the matrices S , S^ k \ 

Lemma 2. Let the X* {k) = P (fc) A (fc) Q (fc) be the SVD of X* {k) , where P (fc) is nx R and Q (k) is 
n K 1 x R. Assume that R < n and U u ' ] has full column rank. It holds that for all k, 

i) U {k) G Span{pW), 

ii) Q {k) G Span (P (1) ® • • • ® P (fc “ 1} ® P (fc+1) ® • • • ® P w ). 

Proof. We prove the lemma in Appendix D.4. □ 

Corollary 1 shows that when the signal to noise ratio is high enough, we can recover P‘ h) with 
high probability. Hence we suggest the following three-step approach for tensor denoising: 

(i) For each k, unfold the observation tensor in mode k and compute the top H left singular 

— (k) 

vectors. Concatenate these vectors to obtain a n x H matrix P . 

(ii) Construct S [k) as S {k) = P (1) ® • • • ® P ^ ® P (k+1) ® • • • ® P (A) . 


Lemma 1. The dual norm of ||| • ||| s is a semi-norm 

III A III « = max 

s k=l,...,K 
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(iii) Solve the subspace norm regularized minimization problem 

mm \ly-Xf F + \\\Xl„ (8) 

where the subspace norm is associated with the above defined 
See Appendix B for details. 


3.3 Analysis 

Let y G M nx "' xn be a tensor corrupted by Gaussian noise with standard deviation a as follows: 

y = X* + aS. (9) 


We define a slightly modified estimator X as follows: 


A 


argmin[^|||y- Xf F + A|||*|| a : ^ 


£>ld fc (M (fc) S (fc)T ) , {M (fc) }f =1 G A4(p)} 

fc=i 


GO) 


where M.(p) is a restriction of the set of matrices M ^ G M. nxHR 1 , k = 1,..., K defined as 
follows: 

M(fi) := {{M% : ||fold k (M<*>) w || < ), Vfc # <}. 

This restriction makes sure that M^ k \ k — 1,, K, are incoherent, i.e., each M U ' J has a spectral 
norm that is as low as a random matrix when unfolded at a different mode t. Similar assumptions 
were used in low-rank plus sparse matrix decomposition [ 2 , 12 ] and for the denoising bound for 
the latent nuclear norm [23]. 

Then we have the following statement (we prove this in Appendix D.3). 


Theorem 2. Let X p be any tensor that can be expressed as 

X P — fold* 

fc=i 

which satisfies the above incoherence condition } k=1 G A4 (p) and let r k be the rank ofM^fi 

for k — 1,... , K. In addition, we assume that each S' 1 ' 1 is constructed as S ik> = P ( * 8 ■ ■ • 8 

^ (fc+l) ^ (k) -r- ^ (k) 

P with (P ) P — I H - Then there are universal constants c () and C\ such that any solution 
X of the minimization problem (10) with X = |A p —A*||| g »-|-cocr (^/n + s/H K ~ 1 + a/ 2 log(A/5)^ 
satisfies the following bound 

IA ~~ ^*1/ < 1 Xp ~ 1 r k ; 

with probability at least 1 — 5. 
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Note that the right-hand side of the bound consists of two terms. The first term is the ap¬ 
proximation error. This term will be zero if X* lies in Span({S" / ^}('l| ). This is the case, if we 
choose S (k) = I n K~ i as in the latent nuclear norm, or if the condition of Corollary 1 is satisfied 
for the smallest /3r when we use the Kronecker product construction we proposed. Note that the 
regularization constant A should also scale with the dual subspace norm of the residual X p — X*. 

The second term is the estimation error with respect to X p . If we take X p to be the orthogonal 
projection of X* to the Span({S' (fc ^}^ 1 ), we can ignore the contribution of the residual to A, 
because (X p — X*)(k)S (k> = 0. Then the estimation error scales mildly with the dimensions n, 
H k ~ 1 and with the sum of the ranks. Note that if we take S ik ’ = I u k-i, we have H K ~ l = n A-1 , 
and we recover the guarantee (3) . 


4 Experiments 

In this section, we conduct tensor denoising experiments on synthetic and real datasets, to numer¬ 
ically confirm our analysis in previous sections. 

4.1 Synthetic data 

We randomly generated the true rank two tensor X* of size 20 x 30 x 40 with singular values 
f3i = 20 and /3 2 — 10. The true factors are generated as random matrices with orthonormal 
columns. The observation tensor y is then generated by adding Gaussian noise with standard 
deviation a to X*. 

Our approach is compared to the CP decomposition, the overlapped approach, and the latent 
approach. The CP decomposition is computed by the tensorlab [22] with 20 random initializations. 
We assume CP knows the true rank is 2. For the subspace norm, we use Algorithm 2 described in 

~(fc) 

Section 3. We also select the top 2 singular vectors when constructing U ’s. We computed the 
solutions for 20 values of regularization parameter A logarithmically spaced between 1 and 100. 
For the overlapped and the latent norm, we use ADMM described in [25]; we also computed 20 
solutions with the same A’s used for the subspace norm. 

We measure the performance in the relative error defined as \\X — A’*||| i? /||| A’*||| F . We report 
the minimum error obtained by choosing the optimal regularization parameter or the optimal ini¬ 
tialization. Although the regularization parameter could be selected by leaving out some entries 
and measuring the error on these entries, we will not go into tensor completion here for the sake of 
simplicity. 

Figure 2 (a) and (b) show the result of this experiment. The left panel shows the relative 
error for 3 representative values of A for the subspace norm. The black dash-dotted line shows 
the minimum error across all the A’s. The magenta dashed line shows the error corresponding 
to the theoretically motivated choice A = cr(max fc (y / nfe + V H K ~ l ) + a/ 2 log(JT)) for each a. 
The two vertical lines are thresholds of a from Corollary 1 corresponding to /?i and /3 2 , namely, 
/VOL™*) 174 ar| d /V0L^) 1/4 - ft confirms that there is a rather sharp increase in the error 
around the theoretically predicted places (see also Figure 1(b)). We can also see that the optimal A 
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(a) 


(b) 


(c) 


Figure 2: Tensor denoising. (a) The subspace approach with three representative A’s on synthetic 
data, (b) Comparison of different methods on synthetic data, (c) Comparison on amino acids data. 

should grow linearly with a. For large a (small SNR), the best relative error is 1 since the optimal 
choice of the regularization parameter A leads to predicting with X = 0. 

Figure 2 (b) compares the performance of the subspace norm to other approaches. For each 
method the smallest error corresponding to the optimal choice of the regularization parameter A is 
shown. In addition, to place the numbers in context, we plot the line corresponding to 


Relative error 


V R Hk n k log(A~) . ^ 


( 11 ) 


which we call “optimistic”. This can be motivated from considering the (non-tractable) maximum 
likelihood estimator for CP decomposition (see Appendix A). 

Clearly, the error of CP, the subspace norm, and “optimistic” grows at the same rate, much 
slower than overlap and latent. The error of CP increases beyond 1, as no regularization is imposed 
(see Appendix C for more experiments). We can see that both CP and the subspace norm are 
behaving near optimally in this setting, although such behavior is guaranteed for the subspace 
norm whereas it is hard to give any such guarantee for the CP decomposition based on nonlinear 
optimization. 


4.2 Amino acids data 

The amino acid dataset [5] is a semi-realistic dataset commonly used as a benchmark for low rank 
tensor modeling. It consists of five laboratory-made samples, each one contains different amounts 
of tyrosine, tryptophan and phenylalanine. The spectrum of their excitation wavelength (250-300 
nm) and emission (250-450 nm) are measured by fluorescence, which gives a5x201x61 tensor. 
As the true factors are known to be these three acids, this data perfectly suits the CP model. The 
true rank is fed into CP and the proposed approach as H = 3. We computed the solutions of CP for 
20 different random initializations, and the solutions of other approaches with 20 different values 
of A. For the subspace and the overlapped approach, A’s are logarithmically spaced between 10 3 
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and 10 5 . For the latent approach, A’s are logarithmically spaced between 10 4 and 10 6 . Again, we 
include the optimistic scaling (11) to put the numbers in context. 

Figure 2(c) shows the smallest relative error achieved by all methods we compare. Similar to 
the synthetic data, both CP and the subspace norm behaves near ideally, though the relative error of 
CP can be larger than 1 due to the lack of regularization. Interestingly the theoretically suggested 
scaling of the regularization parameter A is almost optimal. 


5 Conclusion 

We have settled a conjecture posed by [20] and showed that indeed 0(n K / 4 ) signal-to-noise ratio 
is sufficient also for odd order tensors. Moreover, our analysis shows an interesting two-phase 
behavior of the error. This finding lead us to the development of the proposed subspace norm. The 

•~( 1 ) ~(K) 

proposed norm is defined with respect to a set of orthonormal matrices P ,..., P , which are 
estimated by mode-wise singular value decompositions. We have analyzed the denoising perfor¬ 
mance of the proposed norm, and shown that the error can be bounded by the sum of two terms, 
which can be interpreted as an approximation error term coming from the first (non-convex) step, 
and an estimation error term coming from the second (convex) step. 


References 

[1] E. Acar, D. M. Dunlavy, T. G. Kolda, and M. Mprup. Scalable tensor factorizations for 
incomplete data. Chemometr. Intell. Lab., 106(l):41-56, 2011. 

[2] A. Agarwal, S. Negahban, and M. J. Wainwright. Noisy matrix decomposition via convex 
relaxation: Optimal rates in high dimensions. Ann. Stat., 40(2): 1171—1197, 2012. 

[3] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for 
learning latent variable models. J. Mach. Learn. Res., 15( 1):2773—2832, 2014. 

[4] F. Benaych-Georges and R. R. Nadakuditi. The eigenvalues and eigenvectors of finite, low 
rank perturbations of large random matrices. Advances in Mathematics, 227(1):494-521, 
2011 . 

[5] R. Bro. Parafac. tutorial and applications. Chemometr. Intell. Lab., 38(2): 149—171, 1997. 

[6] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposi¬ 
tion. SIAM J. Matrix Anal. Appl., 21 (4): 1253—1278, 2000. 

[7] L. De Lathauwer, B. De Moor, and J. Vandewalle. On the best rank-1 and rank- 
(/7 1 . 11-2, .... A.v) approximation of higher-order tensors. SIAM J. Matrix Anal. Appl., 
21(4): 1324-1342, 2000. 

[8] C. J. Hillar and L.-H. Lim. Most tensor problems are np-hard. J. ACM, 60(6):45, 2013. 


11 


[9] F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of products. J. Math. 
Phys., 6(1): 164-189, 1927. 

[10] S. B. Hopkins, J. Shi, and D. Steurer. Tensor principal component analysis via sum-of-squares 
proofs. Technical report, arXiv: 1507.03269, 2015. 

[11] J. Hastad. Tensor rank is NP-complete. Journal of Algorithms, 11(4): 644-654, 1990. 

[12] D. Hsu, S. M. Kakade, and T. Zhang. Robust matrix decomposition with sparse corruptions. 
Information Theory, IEEE Transactions on, 57(11):7221—7234, 2011. 

[13] R Jain and S. Oh. Provable tensor factorization with missing data. In Adv. Neural. Inf. 
Process. Syst. 27, pages 1431-1439, 2014. 

[14] T. G. Kolda. Orthogonal tensor decompositions. SIAM Journal on Matrix Analysis and 
Applications, 23(l):243-255, 2001. 

[15] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 
51(3):455-500, 2009. 

[16] T. G. Kolda and J. R. Mayo. Shifted power method for computing tensor eigenpairs. SIAM 
J. Matrix Anal. Appl., 32(4): 1095-1124, 2011. 

[17] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. 
Ann. Stat., pages 1302-1338, 2000. 

[18] M. Mprup. Applications of tensor (multiway array) factorizations and decompositions in data 
mining. Wiley Interdisciplinary Rev.: Data Min. Knowl. Dicov., l(l):24-40, 2011. 

[19] C. Mu, B. Huang, J. Wright, and D. Goldfarb. Square deal: Lower bounds and improved 
relaxations for tensor recovery. In Proc. ICML '14. 2014. 

[20] E. Richard and A. Montanari. A statistical model for tensor pea. In Adv. Neurcd. Inf. Process. 
Syst. 27, pages 2897-2905, 2014. 

[21] L. Sorber, M. Van Barel, and L. De Lathauwer. Optimization-based algorithms for tensor 
decompositions: Canonical polyadic decomposition, decomposition in rank-(l_r,l_r,l) terms, 
and a new generalization. SIAM Journal on Optimization, 23(2):695-720, 2013. 

[22] L. Sorber, M. Van Barel, and L. T. De Lathauwer. Tensorlab v2.0. http://www. 
tensorlab. net, 2014. 

[23] R. Tomioka and T. Suzuki. Convex tensor decomposition via structured Schatten norm regu¬ 
larization. In Adv. Neurcd. Inf. Process. Syst. 26, pages 1331-1339. 2013. 

[24] R. Tomioka and T. Suzuki. Spectral norm of random tensors. Technical report, 
arXiv: 1407.1870, 2014. 


12 



[25] R. Tomioka, T. Suzuki, K. Hayashi, and H. Kashima. Statistical performance of convex tensor 
decomposition. In Adv. Neural. Inf. Process. Syst. 24, pages 972-980. 2011. 

[26] L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 
31 (3) :279—311, 1966. 

[27] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. Technical 
report, arXiv: 1011.3027, 2010. 

[28] P.-A. Wedin. Perturbation bounds in connection with singular value decomposition. BIT 
Numerical Mathematics, 12(1):99—111, 1972. 


A Maximum likelihood estimator 


Let y G M '" 1 x "' x,t,v be a noisy observed tensor generated as follows: 

R 

y = X* + cr£ = ^ f3 r u ( r v> o ■ ■ ■ o ul K) + a£, 


r =1 


where £ is a noisy tensor whose entries are i.i.d. normal JV(0,1 ). 

Let Xmle be the (intractable) estimator defined as 

X M le = argmin (}\y - X\f F : rank(,T) < R ) . 
x 

We have the following performance guarantee for Xmle'- 
Theorem 3. Let R < min^ iik/ 2. Then there is a constant c such that 


Xmlf. — X* I F < co 


A 


I< 


R K rik l °S^K/Ko) + log(2 /6), 


k =1 


with probability at least 1 — 5, where K 0 = log(3/2). 

Note that the factor R h in the square root is rather conservative. In the best case, this factor 
reduces to linear in R and this is what we present in Section 4 as “optimistic” ignoring constants 
and 5; see Eq. (11). 

Proof of Theorem 3. Since X M le is a minimizer and X * is also feasible, we have 

\\y-XuLEi\l < 

which implies 

|||.\ * — ✓TmleI^ < cr{£, Xmle ~ X*) 

X cr HI ((I_HI Amle — X* 


op" 


II nuc ’ 
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where 


II AIL := 


"op 


sup { 




x . . M (i y 2) ... M (jc) 


luWlI = ||L 2) || = ••• = ||«W|| = 1} 

is the tensor spectral norm and the nuclear norm 



r =1 


is the dual of the spectral norm. 

Since both Amt f and A* are rank at most R, the difference A M le — A* is rank at most 2 R. 
Moreover, any rank-/? CP decomposition with R < rnirifc n k can be reduced to an orthogonal CP 
decomposition with rank at most R h via the Tucker decomposition [14]. Thus, denoting this or¬ 
thogonal decomposition by A M le—A* = • ■°'^ { r K ' ) and using/3 r := ||ft^ 1 ' ) || • • • ||u^||, 

we have 


r k 


iiia mle - A*i lluc < y: d r < 

r =1 



P 


2 

r 


= Vr« IAmle - A* 


where the last equality follows because the decomposition is orthogonal. 

Finally applying the tail bound for the spectral norm ||£|| op of random Gaussian tensor £ [24], 
we obtain what we wanted. □ 


B Details of optimization 


For solving problem (8), we follow the alternating direction method of multipliers described in 
[25]. We scale the objective function in (8) by 1/A, and consider the dual problem 


min 

Api 

v,{w^} f = i 

2 hi i 

s.t. 

max 


k 

w-w 


f - (v, y) 

w (k) || < 1, 

W^=D {k) S^ k \ k = l,...,K, 


( 12 ) 


where V e fmx^x-xnK- p s the dual tensor that corresponds to the residual in the primal problem 
(8), and s are auxiliary variables introduced to make the problem equality constrained. 
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Algorithm 1: Tensor denoising via the subspace norm 

Input: noisy tensor y, subspace dimension H, regularization constant A 

for k = 1 to K do 

^ (&) 

P i — top H left singular vectors of Y (k) 

end for 

for k = 1 to K do 

s (k) < — p (1) ® • • • ® p (fc_1) ® p (A+1) ® ® p (A) 

end for 

Output: X = argmin x ||||;y - X\\ 2 F + A|||A’|| S . 


The augmented Lagrangian function of problem (12) could be written as follows: 

A A ' 

= -||x?| 2 - (v,y) + J2 {(mW,d w sw - w^) 

k =1 

+ ||| D (k) S^ - WW ||| + l||-||< 1 (^ (fc) )) ) 

where AT^’s are the multipliers, rj is the augmenting parameter, and 1 ||.||<i is the indicator func¬ 
tion of the unit spectral norm ball. 

We follow the derivation in [25] and conclude that the updates of V, M‘ k) and W u ' ] can be 
computed in closed forms. We further combine the updates of W^ k) and other steps so that it needs 
not to be explicitly computed. The sum of the products of M and S u, ) finally converges to the 
solution of the primal problem (8), see Algorithm 2. 

The update for the Lagrangian multipliers ( k = 1,..., K) is written as singular value 

soft-thresholding operator defined as 

prox* r (Z) = P rnax(S — 77, 0 )Q T , 

where Z = P£Q t is the SVD of Z. 

A notable property of the subspace norm is the computational efficiency. The update of AT^ 
requires singular value decomposition, which usually dominates the costs of computation. For 
problem (12), the size of M ^ is only x H K ~ 1 . Comparing with previous approaches, e.g. the 
latent approach whose multipliers are rik x Ylk'^k rik ' matrices, the size of our variables is much 
smaller, so the per-iteration cost is reduced. 

C Additional experiments 

We report the experimental results when the input rank of CP and the subspace approach is are 
over-specified, on the same synthetic dataset as Section 4. We consider the case where the input 
rank is 8. 
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Algorithm 2: ADMM for subspace norm minimization 

Input: y. A, ..., S {K \ r), initializations V 0 , {M^,..., M ( 0 h ' 1 } 
t = 0 

repeat 

®«+i = x^iy + KriV, - E t fold*((2Mf> - 

for k — 1 to K do 

= prc< (iW?> + ,D Wiltl S®) 

end for 

t<-t + 1 
until convergence 

Output: A = Mf' } 5 (fc)T . 


We impose the regularizations on the factors of CP. We test 20 values that are logarithmically 
spaced between 0.01 and 10 are the regularization parameter. For each value, we compute 20 
solutions with random initializations and select the one with lowest objective value. 

For the subspace approach, we computed solutions for 20 values of the regularization parameter 
that are logarithmically spaced between 1 and 1000. 

As before, we report the minimum relative error obtained by the same method. The results are 
shown in Figure 3. We include the case the rank is specified incorrectly for comparison. Clearly, 
even if the rank is much larger than the truth, the subspace approach and CP are robust with proper 
regularization. 


D Proofs 


D.l Proof of Theorem 1 

We consider the second moment of X: 

XX = (3 2 uu t + a 2 EE T + Pa(uv T E T + EvvJ) 

B 

/ - ^ 1 \ 

= f3 2 uu T + ma 2 I + 

G 

/--X 

<j 2 EE t - ma 2 I + (3a(uv T E T + Evu T ). 


The eigenvalue decomposition of B can be written as 

i? — fit w ^ + m °‘ 


ma 2 I 


1 - 

1- 

h 

s 

1 _ 

1 

bo H 

1 _ 
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Figure 3: Tensor denoising on synthetic dataset when the input rank is larger than the truth. 

We first show a deterministic lower bound for |(£i, u)| assuming (3 2 > 2||G||, where ii is the 
leading eigenvector of XX . Then we bound the spectral norm ||G|| of the noise term (Lemma 
3) and derive the sufficient condition for (3. 

Let ii be the leading eigenvector of XX with eigenvalue A, r = Bii — Xu = —Gii. We 
have Ujr = (mcr 2 — A )Uju. Hence, for all (3 2 > 2||G||, it holds that 


sin(it, -ii,) | 


\UJu\ 


WMl < IIG1I 2 ||G|| 
X-ma 2 ~ P 2 ~\\G\\ - (3 2 


where we used ||LLjV ||2 = IlG^Giil^ < ||G||, and A > u T XX T u T > (3 2 + mcr 2 — ||G 
Therefore, 


if/3 2 > 21|G 

It follows from Lemma 3 (shown below) that 


(^ U )|=|co S (n, U ,|> 1 /l-^>l - 4||G|12 




•4 — 


/ 3 4 ’ 


2 Ccr 2 s/mn, 
2Cf3a^/n, 


if (3/a < \/m , 

otherwise, 


where C is a universal constant with probability at least 1 — 4e n . 
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Now consider the first case (13/a < y/m) and assume f3 2 > A Co 2 \Jmn > 2||G||. Note that this 
case only arises when y/m > ACy/n. Denoting C = 16 C 2 , we obtain the first case in the theorem. 
Next, consider the second case ((3/o > y/m). If y/m > ACy/n as above, we have f3/a > ACy/n, 
which implies (3 2 > 211G j and we obtain the second case in the theorem. On the other hand, if 
y/rri < ACy/n, we require f3/o > ACy/n to obtain the last case in the theorem. 

Lemma 3. Let G be constructed as in Theorem 1. Ifm > n, there exists an universal constant C 
such that 

||G|| < Co 2 (y/rrm + y/ n(/3 / o) 2 ^ , 
with probability at least 1 — 4e -n . 

Proof. The proof is an e-net argument. Let 

A = 2 cr 2 (CAmn + An + \/8n(/3/o) 2 ^j . 

The goal is to control \x T Gx\ for all the vectors x on the unit Euclidean sphere S n ~ l . In order to 
do this, we first bound the probability of the tail event \x T Gx\ > A, for any fixed x e S n ~ 1 . Then 
we bound the probability that \x T Gx\ > A for all the vectors in a e-net J\f £ . Finally, we establish 
the connection between sup^^ \x ' Gx\ and ||G||. 

To bound P(|:r T Ga 3 | > A) for a fix x e 5 n_1 , we expand x Gx as 

xGx = cr 2 (||z|| 2 — m) + 2/3a(u T x)y, 

where z = E T x and 7 = v T z. Since z ~ A/"(0, 1), we can see that ||z || 2 is y 2 distributed with m 
degrees of freedom and 7 ~ A/"(0,1). 

First we bound the deviation of the y 2 term. By the corollary of Femma 1 in [17], we have 

P(|||z || 2 -ml > Ai) < 2e _4n , (13) 


where Ai = 2{y/ Amn + An). 

Next we bound the deviation of the Gaussian term. Using the Gaussian tail inequality, we have 


P(|7l>A 2 )<2e- 4r \ 


(14) 


where A 2 = \/8 n. 

Combining inequalities (22) and (14), we have 

P(|a3 T Ga:| > A) 

< P (ex 2 11|z:|| 2 — m| + 12/3cr(^ T £c)71 > a 2 Ai + 2 / 0 X 2 ) 

< P (| ||z|| 2 — m| > Ai V |y| > A 2 ) 

< P (|||z || 2 - m\ > Ai) + P (| 7 | > A 2 ) 

< Ae~ 4n , 
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where the second to last line follows from the union bound. 

Furthermore, using Lemma 5.2 and 5.4 of [27], for any e € [0,1), it holds that 

|A4| < (1 + 2 fe) n , 

and 

||G|| < (1 — 2s)” 1 sup \x t Gx\. 

xeMe 

Taking the union bound over all the vectors in Af\ m, we obtain 

P ( sup \x t Gx\ > A ] < |M /4 |4e- 4n < 4e _n . 

y*eA/'i /4 J 

Finally, the statement is obtained by noticing that n < m. □ 

We prove a more general version of the theorem that allows the signal part to be rank R in 
Appendix E. 

D.2 Proof of Lemma 1 

Proof. By definition, 

K 

IIIVIII,. = sup (y.Vfoid 

fc =1 
K 

s.t.^||M (fc) ||, < 1 

k =1 

K 

= sup y^{Y {k) S^ k \M^) 

{^ (fe) lf=i k =i 

K 

s.t.^||M (fc) ||, < 1 
k =1 

= max||L (fc) 5 (fc) ||, 

where we used the Holder inequality in the last line. □ 

D.3 Proof of Theorem 2 

First we decompose the error as 

III** - X\ F < III** - X P \\ F + \X P - x\\ F . 

The first term is an approximation error that depends on the choice of the subspace S (h 'K The 
second term corresponds to an estimation error and we analyze the second term below. 


19 


Since X is the minimizer of (10) and X p is feasible, 


K 


my - xf F +\ y iiAf w n. < Any - x r f F +a iim«h„ 


K 


k =1 


k =1 


from which we have 
1 


K 


x r - xf F < iiv- x p \\,.\\x r - xi + \Y( Ill’ll- - 


r (*). 


( 15 ) 


k =1 


Next we define A k := M ( ^ E W lkXRK 1 and define its orthogonal decomposition 

A fc = A' fc + A" as 


K 


■ n k 


U F 


)A AI 


h k-i 


V p ), 


where P[/ p and P Vp are projection matrices to the column and row spaces of M^\ respectively, 
and A' fc := A/, - A". 

~ (^) 

The above definition allows us to decompose \\M ||* as follows: 


Moreover, 


= ||Mf + AJ + A'J. 

> l|AfW||, + IIAJII, - ||A'J„ 


K K 

- xi < Y IIA t .||, < Y (II'^11. + IKii.) 

k =1 k= 1 


( 16 ) 


(17) 


Combining inequalities (15)—(17), we have 


1 

2 




<dl y-x r 


K 


a) E n^fcii*+ 


- X, 


'Pills* 


k =1 


K 


a) Eliz¬ 


as) 


Since 


\iy-X p \l s ,<al\S\\ s , + \lX*-X 1 


Pills* 5 


if A > cr||£||| s * + \\X* — A’plg,, the second term in the right-hand side of inequality (18) can be 
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ignored and we have 


1 

2 




K 


<2A£||A'J, 

k= 1 


K 

<2A^ v / 2^||A / fc || F 

k =1 
I< 

< 2A \/2r , fc|| Afe||_F 

k =1 


< 2\/2A, 


ft: 

E 




ft' 

E 

fc=i 




(19) 


where in the second line we used a simple observation that rank(A / fc ) < 2 /y. 

Next, we relate the norm \X p — X\\ F to the sum Y2k=i 11A /, 11 2 F in the right-hand side of in¬ 
equality (19). 

First suppose that Ylk =i II A fc |||i < \X p — X^ F . Then from inequality (19), we have 


K 


I a; - x\ F < aV2\ 




22 rk 


by dividing both sides by \\\X p — X\\ F . 

On the other hand, if \\X p — X\\ 2 F < Y2k=i I A/,,|| we use the following lemma 

Lemma 4. Suppose {M^}f =v {M^}^ =1 <G M(p), and S (A:) is constructed as a Kronecker 
product of K—l ortho-normal matrices P as S - = P , where (P ) P = 

I H fort = 1,..., K. Then for X p = £f =1 fold* ^M ( p k) S (k)T ^j and X = £f =1 fold fc (M (fc) S (fc)T ), 
the following inequality holds: 


i K i A ,_ K 

- 22 l|Ar-||| < ~JX P - X\ 2 f + pm&x(y/nf + \ / H K ~ 1 ) 22 ||A 


k *■ 


k =i 

Proof. The proof is presented in Section D.5. 
Combining inequalities (18) and (20), we have 

K 


k= 1 


K 


22 II^IIf ^ (l^ _ ^pIL* +pmax(v^ + V + Aj 22 II A'* 


II* 


k =1 


k =1 


ft: 


III3 7 - A’pl,. +pmax(y^ + yf H K ~ l ) - A ) ^ ||A 


//ii 


k =1 


( 20 ) 

O 
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Thus if we take A > cr|||£||| s , + \X* — X p \ s * + pmax fc ( % /n^ + \fH K ~ 1 ), the second term in the 
right-hand side can be ignored and following the derivation leading to inequality (19) and dividing 

both sides by \jY^ k = i || A/||, we have 


| X p — X\ F < 


K 


* J]||A fc |||<4v/2A, 

\ k =1 


K 


\ X>*> 

\ k =1 


where the first inequality follows from the assumption. 

The final step of the proof is to bound the norm |||P||/ with sufficiently high probability. By 
Lemma 1, 


= max p( fc) SW||. 

Therefore, taking the union bound, we have 

K 

P (max ||-E(fc)S (fc) || > t) < J^P (||£ (fc) S (A:) || > t 


k =1 


( 21 ) 


Now since each Ei k )S^ G W lkXHh ' is a random matrix with i.i.d. standard Gaussian entries, 

P ^||^( fc )5 (fc) || > yjvf + Vh k ~ 1 + tj < exp(—t 2 /(2cr 2 )). 


Therefore, choosing t = max fc ( 1 /n^ + V // A_1 ) + a/2 \og(K/S) in inequality (21), we have 
max ||-E(fc)S (fc) || < max( A /n~k + VII K / + yj2 \og{K/8), 

k k 

with probability at least 1 — 5. Plugging this into the condition for the regularization parameter A, 
we obtain what we wanted. 


D.4 Proof of Lemma 2 

Proof, i) Let ® k 'e[K]\kU {k,) denote t/ (1) ® • • • ® U {k ~ l) ® C/ (fc+1) ® • • • ® U {K) . We have 

X\ k) = U^C (k) (®k'e[K]\kU^y 

= u^c (k) (<g> feWV /L/ (fc,) ) T ) 

= pWaW(qW) t . 

Because of the minimality of the Tucker decomposition (5), X* k y C{k ) and U are all of 
rank R, for all k G [K], Therefore, both C( k ) and &k'e[K]\k(U (k / T have full row rank. 
Hence, C( k ) has a Moore-Penrose pseudo inverse C| fe) such that C( k )C^ = I, and so does 
G) fc / e [x]\fc(C/ (fc, / T . As a result, we have 

u(k) = pw A W( Q W) t (®,, emVfe (c/( fe, )) T ) t c; fc) . 
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ii) Similarly, we have 


Q(fc) A (fc)(pW ) T = (® k , e[K]Xk U^ Cj k) (U^) T . 

By the definition of SVD, A is invertible and = /. Hence, 

Q (k) = cJ k) (uW) T pW( A^)- 1 . 

This means G span j and we then conclude Q^O G span 

using (i). 

□ 


D.5 Proof of Lemma 4 


Expanding X p and X, we have 

\x p - *|& = IIEJLi fold fc (A,5 (fc)T ) \f F 

K 


II Afclll. + £)<fold fc (A k S^ T ), fold,(A,5^ T )) 


k=1 




ll^fcllf + y~l(foldfc(A fc ) Xfc/^fc \ fold^(A^) P l j ) 


.(O, 


fc=i 

/v 




Sl|A*llF + Z)<fold*(Afc) x,P W fold,(A,) x fc P w ) 




fc=i 

/V 






;(*) 


/ H ) T ,P vv (fold,(A,)) (jt) ) 


M 


k= 1 
ir 




>En A *-E ||Afe||* ■ ||(fold^(A^)) (fc) || 

fc=i k^e 

rr k 

> ^ ll A fc||| - 2p max( A /nfc + \ / H K - 1 ) ^ ll A fcll*> 


fc=i 


fc=i 


from which the lemma holds. Here we regarded fold fc (A fc S^) as a Tucker decomposition with 

^ (&') 

the core tensor fold*; (A*,.) and factor matrices P for k' ^ k. Most of the factors except for k 
and i cancel out when calculating the inner product between two such tensors in the third line, 

because (P ) P = I H . After unfolding the inner product at the /;:th mode in the fifth line, 
we notice that a multiplication by an ortho-normal matrix does not affect the nuclear norm or the 
spectral norm. In the last line we used {A/Jj'L, G A4(2p), which follows from the assumption 


that both {M W }Li 6 M(p). 


□ 
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E Generalization of Theorem 1 to the higher rank case 

Theorem 4. Suppose that X = (3 r u r vJ, where ui, ..., ur G M" and Vi ,..., v R G 
are unit orthogonal vectors respectively. Let X = X + aE be the noisy observation of X. There 
exists an universal constant C such that with probability at least 1 — 3e~ n , ifm/n > C{P\ //3r ) 4 , 
then 


cos(?7, U)\ > 


Cmn 

Cn(fi\/ (Sr) 2 
{Pr/<?) 2 

Cn(Pi/p R ) 2 u 



■ r P\ ^ Pr 

if < cr < 


, if cr < 


m (Cmn 

Pi 


1 ? 
I 4 


m 


otherwise, \ cos(U,U)\ >1 - 'p' 1 ' if a < pf,/(Cn) 2 fj { . 

(Pr/pY 

Suppose that X = P 2 u r vJ and X = X + aE. We consider the second moment of X: 


R 


R 


XX = Y Pr-UruJ + a [Y Pr{u r vjE T + Ev r uJ) + a 2 EE T 


r =1 


. r= 1 


B 


R 


Y Pr U r u J- + 771 ° 2 I + 


r =1 


G 


R 


a 2 EE T — mo 2 1 + a J Y^ Pr ( u r v JE T + Ev, r uJ ) 


, i =1 


The eigenvalue decomposition of B can be written as 

B = [U U,' " S + m,T2J 


ma 2 I 


1 - 

\U T ] 

1_ 

1 

bO_| 

1- 


where U G M nxi? and E = diag(/3f,... ,P R ). Similarly, the eigenvalue decomposition of XX 
can be written as 

= [u u 2 \ 

where E = diag(Ai,..., A R ) and E' = diag(A R+ i,..., A n ) s.t. Ai > ■ ■ ■ > X n are the eigenvalues 
ofXX T . 

We first show a deterministic lower bound for | sin(47. U) assuming /3r > 2||G||. Then we 
bound the spectral norm ||G|| of the noise term (Lemma 3) and derive the sufficient condition for 


S' 


u 

ul 
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The maximum singular value of mo 2 1 is mo 2 . The minimum singular value of £ is |Ar|. By 
Wely’s theorem, ||G|| > \Xr — (5 2 R — mcr 2 |, which means 

A R > mo 2 + P\ ||G||. 

Let R = GU. Since 6j { > 2||G||, we can apply the Wedin theorem and obtain 


sm(U,U)\ = \\U^U\\< 


\R\ 


\GU\ 


< 


\G\ 


P 2 n-\\G\\ Pl-\G\\ ~ Pi-\\G 


<m 

- P 2 r 


where we used the property that the spectral norm is sub-multiplicative and \\U\\ 
to last step. 

Therefore, 


cos(17, U)\ > 




1 in the second 


if P R > 21|G?||. It follows from Lemma 5 (shown below) that with probability at least 1 — 3e n 


G\\< 


2Co 2 ^/mn, 

2Ccry/n(3i, 


if Pi jo < yjm, 

otherwise, 


where C is an universal constant. 

Let C = 16 C 2 . Now consider the first situation where m/n > C(fii/P r) a . If o > -^=, we 
have ||G|| < y(Cmn)i Meanwhile, if o < — r , then we have f3% > o 2 (Cmn > 211 G 11. 
Combining these two conditions we obtain the first case in the theorem. When o < ^=, we can see 

that ||G|| < ^(Cn)^P\. Moreover, since m/n > C(/3i/Pr) 4 , it is implied that a < Pn/{Cn)hp 1 
and thus P\ > 2||G||. This gives us the second case. 

On the other hand, if m/n < C(Pi/Pr) 4 , we require o < p 2 R /(Cn)zPi to obtain the last case 
in the theorem. 


Lemma 5. Let G be constructed as in the proof of Theorem 4. If m > n, there exists an universal 
constant C such that 

||G|| < Co 2 (yjmn + y/nPi/o) , 
with probability at least 1 — 3e _n . 

Proof The proof is an £-net argument. Let 


A = 2a 2 



The goal is to control \x T Gx\ for all the vectors x on the unit Euclidean sphere S n '. In order to 
do this, we first bound the probability of the tail event \x ' Gx\ > A, for any fixed x e S n ~ 1 . Then 
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we bound the probability that \x T Gx\ > A for all the vectors in a e-net J\f £ . Finally, we establish 
the connection between sup x&v - e \x ' Gx\ and ||G||. 

To bound P(|cc T Ga;| > A) for a fix x e <S” _1 , we expand x Gx as 

R 

xGx = cr 2 (|| 2|| 2 — m) + 2a'y^ j f3 r ^ r (uj x), 

r =1 


where 2 = E J x and y r = vjz. It is easy to see that ~ J\f( 0,1), 2 ~ Af(0,1) and || 2|| 2 is y 2 
distributed with m degrees of freedom. 

Let 7 = [ 71 ,, 7 R ] and u> = [u] r x ,..., u^x]. We have 


la; Gx\ <cr 2 || 2 1|" — m + 2a 


R 

E 

r= 1 


Prlriujx) 


R 

<a 2 11| 2 1| 2 — ml + 2cr > max |/3 r | • l 7 r | • lujx 

. ^ re[B| 

r =1 


<a 2 11|21| 2 — m| + 2crf3i ■ ||7|| ■ ||cj 

<a 2 || 2|| 2 —m +2a/3 1 -|| 7 ||, 


where we used the Cauchy-Schwarz inequality in the second to last line and the fact ||cj| < 1 in the 
last line. Note that 71 ,..., 7 r are i.i.d standard Gaussian distributed so that || 7|| 2 is y 2 distributed 
with R degrees. 

First we bound the deviation of the xh term. By the corollary of Lemma 1 in [17], we have 


P(||| 2|| 2 - ml > Ai) < 2e _4n , 


( 22 ) 


where Ai = 2{\/Amn + An). 

Next we bound the x 2 r term. Similarly, we have 


P(l | 7 || 2 — -R > A 2 ) < e _4n , 


(23) 


where A 2 = 2{y/ARn + An). 

Combining inequalities (22) and (23), we have 


P(|a; T Ga;| > A) 

< P ^(T 2 ||| 2 || 2 - m\ + 2cr/?i|| 7 || > a 2 Ai + 2 cr/3 1 -y / R+ A 2 j 

< P ^11| 2 1| 2 - m\ > Ai V ||7|| > \JR + X 2 S j 

< P (111 2 11 2 — '/7T. | >A 1 )+p(||7|| > ^/R + A 2 ) 

< 3e _4n , 

where the second to last line follows from the union bound. 
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Furthermore, using Lemma 5.2 and 5.4 of [27], for any e G [0,1), it holds that 

|A4| < (1 + 2 /e) n , 

and 

||G|| < (1 — 2e)~ l sup \x t Gx\. 

xeAfs 

Taking the union bound over all the vectors in A/ 1 / 4 , we obtain 

P(||G|| < 2A) < P ( sup \x t Gx\ > X j < |A/"i /4 |3e _4n < 3 e - n . 

Finally, the statement is obtained by noticing that n < m and R < n. □ 
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